Data Exploration

This is the first notebook of the analysis.

> Make sure you have executed the previous one 0_Cover.ipynb, otherwise the packages might fail.

Objective

Before jumping into modeling, it is very important to explore the data. The following cells will load the necessary packages and perform a Exploratory Data Analysis on it.

At the end, a dataset will be pickled into a compressed file, which will be later used in the next processes (Preparation, Train and Forecast)


Wind Time Series

This project uses a wind time series provided by the professor. The data comes in a columnar format, where each record shows the hourly wind-speed for a series of days. The table below depicts the table format:

Column Index Format Description
1 $$0<hour<=23$$ Hour reference
2 $$1<day<=31$$ First day of the series
. . $n_{th}$ day of the series
31 $$1<day<=31$$ Last day of the series

The cell below loads the dataset.


Feature Analysis

Altough the provided dataset could be reshaped into a "single feature" dataset, with wind speed as the only feature. It could be interesting to investigate how wind velocity behaves for every hour of the day. The table below shows how the wind speed was recorded.

Index Feature Format Description
1 hour int record hour
2 day1 float record wind speed
3 day2 float record wind speed
. . . .
32 day31 float record wind speed

A visual analysis of the wind speed behaviour will be conducted in the next cells using multiple plots. In order to do that, the following steps were taken

Preparing the dataset

Ploting the records

Thanks to the plotly package and this notebook, we can now plot the records in an interactively way.

Notes

The records already show that, even though the wind behaviour is the same, observing it from a diferent time perspective might be a good way to find correlations.

Another thing that becomes clear is that, ploting the records alone is not suficient to understand the statistical behaviour of the feature. The next cells addresses this issue

Statistical Behaviour

At this point it is important to highlight that we are dealing with a time series, because of that the main statistical measurements will be obtained from a "point of view". For example, if we observe the measurements grouped by hour, we would have 31 records for every hour (1 for each day), hence any statistical measure will be obtained regarding a sample of 31 measurements.

Mean, Standard Deviation and Box Plot Analysis

The following cell creates a function used to plot the data using statistical calculations. Each plot contains two subplots. In the first one, the main objective is to observe the average wind speed from two diferent perspectives (daily or hourly). The second plot is a very powerfull statistical tool, the Box Plot, which allows the observation of important statistical measurements, such as the median values, the outliers and the quarters.

Hourly Grouped

By transposing the dataset, we achieve a "hourly" point of view.

Statistics

Daily Grouped

Statistics

Pickle Dataframe

This is the final step of this notebook.

Once completed, the analysed dataframe should be pickled into a compressed file.

This is a practice that allows the next Notebook to start from where this one left.